30 research outputs found
Q-learning with Nearest Neighbors
We consider model-free reinforcement learning for infinite-horizon discounted
Markov Decision Processes (MDPs) with a continuous state space and unknown
transition kernel, when only a single sample path under an arbitrary policy of
the system is available. We consider the Nearest Neighbor Q-Learning (NNQL)
algorithm to learn the optimal Q function using nearest neighbor regression
method. As the main contribution, we provide tight finite sample analysis of
the convergence rate. In particular, for MDPs with a -dimensional state
space and the discounted factor , given an arbitrary sample
path with "covering time" , we establish that the algorithm is guaranteed
to output an -accurate estimate of the optimal Q-function using
samples. For instance, for a
well-behaved MDP, the covering time of the sample path under the purely random
policy scales as so the sample
complexity scales as Indeed, we
establish a lower bound that argues that the dependence of is necessary.Comment: Accepted to NIPS 201
Greed Works -- Online Algorithms For Unrelated Machine Stochastic Scheduling
This paper establishes performance guarantees for online algorithms that
schedule stochastic, nonpreemptive jobs on unrelated machines to minimize the
expected total weighted completion time. Prior work on unrelated machine
scheduling with stochastic jobs was restricted to the offline case, and
required linear or convex programming relaxations for the assignment of jobs to
machines. The algorithms introduced in this paper are purely combinatorial. The
performance bounds are of the same order of magnitude as those of earlier work,
and depend linearly on an upper bound on the squared coefficient of variation
of the jobs' processing times. Specifically for deterministic processing times,
without and with release times, the competitive ratios are 4 and 7.216,
respectively. As to the technical contribution, the paper shows how dual
fitting techniques can be used for stochastic and nonpreemptive scheduling
problems.Comment: Preliminary version appeared in IPCO 201
Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes
We develop several provably efficient model-free reinforcement learning (RL)
algorithms for infinite-horizon average-reward Markov Decision Processes
(MDPs). We consider both online setting and the setting with access to a
simulator. In the online setting, we propose model-free RL algorithms based on
reference-advantage decomposition. Our algorithm achieves
regret after steps, where
is the size of state-action space, and
the span of the optimal bias function. Our results are the
first to achieve optimal dependence in for weakly communicating MDPs.
In the simulator setting, we propose a model-free RL algorithm that finds an
-optimal policy using samples, whereas the minimax lower bound is
.
Our results are based on two new techniques that are unique in the
average-reward setting: 1) better discounted approximation by value-difference
estimation; 2) efficient construction of confidence region for the optimal bias
function with space complexity
Scheduling and resource allocation for clouds: novel algorithms, state space collapse and decay of tails
Scheduling and resource allocation in cloud systems is of fundamental importance to system efficiency. The focus of this thesis is to study the fundamental limits of the scheduling and resource allocation problems in clouds, and design provably high-performance algorithms.
In the first part, we consider data-centric scheduling. Data-intensive applications are posing increasingly significant challenges to scheduling in today's computing clusters. The presence of data induces an extremely heterogeneous cluster where processing speed depends on the task-server pair. The situation is further complicated by ever-changing technologies of networking, memory, and software architecture. As a result, a suboptimal scheduling algorithm causes unnecessary delay in job completion and wastes system capacity. We propose a versatile model featuring a multi-class parallel-server system that readily incorporates different characteristics of a variety of systems. The model has been studied by Harrison, Williams and Stolyar, respectively. However, delay optimality in heavy traffic with unknown arrival rate vectors has remained an open problem. We propose novel algorithms that achieve delay optimality with unknown arrival rates. This enables the application of proposed algorithms to data-centric clusters. New proof techniques are required including construction of an ideal load decomposition. To demonstrate the effectiveness of the proposed algorithms, we implement a Hadoop MapReduce scheduler and show that it achieves more than 10 times improvement over existing schedulers.
The second part studies the resource allocation problem for clouds that provide infrastructure as a service, in the form of virtual machines (VMs). Consolidation of multiple VMs on a single physical machine (PM) has been advocated for improving system utilization. VMs placed on the same PM are subject to resource "packing constraint," leading to stochastic dynamic bin packing models for the real-time assignment of VMs to PMs in a data center. Due to finite-sized pools of servers, incoming requests might not be fulfilled immediately and such requests are typically rejected. Hence a meaningful metric in practice is the blocking probability for arriving VM requests. We analyze the power-of-d-choices algorithm, a well-known stateless randomized routing policy with low scheduling overhead. We establish an explicit upper bound on the equilibrium blocking probability, and further demonstrate that the blocking probability exhibits distinct behaviors in different load regions: doubly-exponential decay in the heavy-traffic regime and exponential decay in the critically loaded regime
RL-QN: A Reinforcement Learning Framework for Optimal Control of Queueing Systems
With the rapid advance of information technology, network systems have become
increasingly complex and hence the underlying system dynamics are often unknown
or difficult to characterize. Finding a good network control policy is of
significant importance to achieve desirable network performance (e.g., high
throughput or low delay). In this work, we consider using model-based
reinforcement learning (RL) to learn the optimal control policy for queueing
networks so that the average job delay (or equivalently the average queue
backlog) is minimized. Traditional approaches in RL, however, cannot handle the
unbounded state spaces of the network control problem. To overcome this
difficulty, we propose a new algorithm, called Reinforcement Learning for
Queueing Networks (RL-QN), which applies model-based RL methods over a finite
subset of the state space, while applying a known stabilizing policy for the
rest of the states. We establish that the average queue backlog under RL-QN
with an appropriately constructed subset can be arbitrarily close to the
optimal result. We evaluate RL-QN in dynamic server allocation, routing and
switching problems. Simulation results show that RL-QN minimizes the average
queue backlog effectively
Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Stepsizes
We consider Linear Stochastic Approximation (LSA) with a constant stepsize
and Markovian data. Viewing the joint process of the data and LSA iterate as a
time-homogeneous Markov chain, we prove its convergence to a unique limiting
and stationary distribution in Wasserstein distance and establish
non-asymptotic, geometric convergence rates. Furthermore, we show that the bias
vector of this limit admits an infinite series expansion with respect to the
stepsize. Consequently, the bias is proportional to the stepsize up to higher
order terms. This result stands in contrast with LSA under i.i.d. data, for
which the bias vanishes. In the reversible chain setting, we provide a general
characterization of the relationship between the bias and the mixing time of
the Markovian data, establishing that they are roughly proportional to each
other.
While Polyak-Ruppert tail-averaging reduces the variance of the LSA iterates,
it does not affect the bias. The above characterization allows us to show that
the bias can be reduced using Richardson-Romberg extrapolation with
stepsizes, which eliminates the leading terms in the bias expansion. This
extrapolation scheme leads to an exponentially smaller bias and an improved
mean squared error, both in theory and empirically. Our results immediately
apply to the Temporal Difference learning algorithm with linear function
approximation, Markovian data, and constant stepsizes